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Abstract 

As  numerous  implementations  have  demonstrated,  software-based  parallel  rendering  is  an  effective  way 
to  obtain  the  needed  computational  power  for  a  variety  of  challenging  applications  in  computer  graphics 
and  scientific  visualization.  To  fully  realize  their  potential,  however,  parallel  Tenderers  need  to  be 
integrated  into  a  complete  environment  for  generating,  manipulating,  and  delivering  visual  data. 

We  examine  the  structure  and  components  of  such  an  environment,  including  the  programming  and 
user  interfaces,  rendering  engines,  and  image  delivery  systems.  We  consider  some  of  the  constraints 
imposed  by  real-world  applications  and  discuss  the  problems  and  issues  involved  in  bringing  parallel 
rendering  out  of  the  lab  and  into  production. 
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1  Introduction 


Synthesizing  high-quality  images  from  abstract  geometric  or  numerical  representations  of  a  scene  is  a 
computationally  demanding  task.  As  the  power  and  availability  of  general-purpose  parallel  computer 
systems  have  grown,  the  computer  graphics  community  has  become  increasingly  interested  in  exploiting 
them  to  support  sophisticated  rendering  methods  and  complex  scenes.  In  numerous  projects  over  the  last 
several  years,  all  of  the  common  rendering  techniques  (polygon,  volume,  and  terrain  rendering,  ray 
tracing,  and  radiosity  methods)  have  been  mapped  onto  parallel  architectures,  ranging  from  tightly- 
coupled  symmetric  multiprocessors  and  data-parallel  SIMD  arrays  to  distributed-memory  message¬ 
passing  systems  and  loosely-coupled  networks  of  workstations.  While  the  performance  and  efficiency  of 
these  implementations  have  varied  widely,  there  have  been  enough  successes  to  conclude  that  carefully- 
constructed  parallel  Tenderers  can  be  an  effective  way  to  obtain  the  needed  processing  power  for 
demanding  applications  in  computer  graphics  and  data  visualization. 

Although  the  rendering  process  contains  ample  parallelism  at  several  different  levels,  the  issues 
involved  in  developing  efficient  parallel  Tenderers  are  complex  [8]  [18],  and  most  of  the  effort  to  date  has 
focused  on  the  rendering  algorithms  themselves  and  their  interactions  with  specific  architectural 
platforms.  The  question  of  integrating  parallel  Tenderers  into  the  broader  computing  environment  has 
often  been  neglected,  and  in  some  cases  explicitly  ignored.  The  purpose  of  this  paper  is  to  examine  the 
role  of  parallel  Tenderers  within  a  broader,  application-oriented  context,  and  to  discuss  some  of  the  issues 
involved  in  moving  parallel  rendering  from  a  research  curiosity  to  a  production  tool.  The  focus  is  on 
software-based  Tenderers  running  on  general-purpose  parallel  platforms — we  do  not  consider  special- 
purpose  architectures  designed  specifically  to  support  rendering. 

We  briefly  review  some  of  the  applications  for  which  software-based  parallel  rendering  is  and  is  not 
appropriate,  and  then  examine  the  overall  software  architecture  needed  to  support  parallel  rendering 
applications.  The  role  of  each  component  within  the  system  is  explored  in  some  detail,  emphasizing  the 
impact  that  each  has  upon  the  others.  We  conclude  with  some  thoughts  on  the  challenges  and 
opportunities  which  await  parallel  graphics  and  visualization  researchers  as  we  move  forward  from  the 
current  state-of-the-art. 

2  Applications  of  Software-Based  Parallel  Rendering 

Hardware-based  rendering  engines  for  workstation-class  systems  are  relatively  affordable  and  provide 
impressive  performance  which  continues  to  grow  at  a  dramatic  rate.  This  raises  an  obvious  question: 
Why  bother  with  software-based  Tenderers  on  complex  architectures  when  off-the-shelf  hardware 
solutions  are  readily  available?  There  are  several  answers. 


1 


While  dedicated  rendering  hardware  can  be  applied  to  many  problems  in  computer  graphics,  it  lacks 
the  flexibility  of  software-based  Tenderers.  Hardware  rendering  engines  usually  provide  direct  support  for 
a  restricted  class  of  rendering  methods  (e.g.,  polygon  rendering),  lighting  models,  and  image  resolutions. 
Alternate  rendering  techniques  such  as  ray  tracing,  radiosity,  and  volume  rendering  often  run  at  much  less 
than  interactive  rates  on  these  systems,  making  them  obvious  candidates  for  software-based  parallel 
solutions.  Software-based  Tenderers  can  be  easily  modified  to  incorporate  alternative  techniques  for 
illumination,  shading,  interpolation,  composition,  etc.  They  can  also  support  arbitrary  classes  of 
geometric  primitives,  e.g.,  spheres  and  parametric  surfaces.  By  exploiting  parallelism,  software-based 
Tenderers  may  be  able  to  regain  some  of  the  performance  which  they  have  sacrificed  in  favor  of 
flexibility. 

Since  many  parallel  rendering  algorithms  exploit  pixel-level  parallelism  by  partitioning  the  image 
across  multiple  processors,  they  are  especially  well-suited  for  graphics  applications  which  require  high 
resolution  or  large  amounts  of  data  at  each  pixel.  By  adding  additional  processors,  more  memory  (as  well 
as  computing  power)  can  be  added  to  support  larger  images,  multiple  views,  supersampling,  transparency, 
etc. 

Visualization  and  graphics  applications  involving  very  large  datasets  are  also  candidates  for  parallel 
rendering.  Large-scale  scientific  applications,  particularly  those  involving  time-dependent  phenomena, 
can  generate  results  which  range  from  hundreds  of  megabytes  to  hundreds  of  gigabytes  in  size.  These 
datasets  may  be  too  large  to  process  effectively  with  anything  less  than  a  supercomputer-class  system,  or 
too  cumbersome  to  move  across  the  network  for  postprocessing  elsewhere.  In  such  cases  it  may  be  more 
practical  to  perform  the  visualization  and  graphics  operations  in  parallel  on  the  system  where  the  data 
originates,  transmitting  images,  rather  than  the  raw  data,  back  to  the  user. 

With  appropriate  interfaces,  parallel  Tenderers  enable  parallel  application  programs  to  produce  live 
visual  output  at  runtime.  This  visual  feedback  is  especially  helpful  with  large,  complex  problems,  where 
it  can  be  employed  for  debugging  or  to  monitor  the  progress  of  executing  jobs.  Visual  output  can  also 
play  an  important  role  in  interactive  steering  applications,  where  the  user  adjusts  the  execution  parameters 
at  runtime  to  explore  larger  design  spaces  or  to  reach  a  solution  more  quickly. 

2.1  Limitations 

For  the  majority  of  graphics  applications,  software-based  parallel  rendering  is  not  an  appropriate  choice. 
Workstation-class  systems  provide  more  than  enough  power  except  for  the  largest  problems  or  the  most 
expensive  rendering  methods.  It  also  seems  unlikely  that  software  Tenderers  running  on  general-purpose 
parallel  systems  will  ever  be  able  to  compete  with  commercial  hardware  rendering  engines  on  a 
price/performance  basis.  By  tuning  the  architectural  design  for  a  constrained  set  of  graphics  operations, 
hardware  engines  eliminate  redundant  components  (such  as  power  supplies,  circuit  boards,  and  I/O 
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subsystems),  and  provide  dedicated  high-speed  data 
paths  among  the  processing  elements  and  to  the  frame 
buffer. 

In  addition,  applications  requiring  highly  interactive 
response  are  probably  not  well-suited  to  software- 
based  rendering.  While  the  computational  demands  of 
virtual  reality  and  real-time  simulation  (particularly  for 
large  datasets  or  complex  scenes)  make  parallel 
processing  an  attractive  option,  they  have  stringent 
requirements  on  frame  rates  (>  10  fps)  and  latencies 
(<  10  ms)  which  are  difficult  to  achieve  with  software 
solutions  on  general-purpose  systems.  At  the  very 
least,  hardware  support  for  image  output  (e.g.,  [17])  appears  to  be  essential  in  providing  smooth 
interactive  operation  with  current  architectures.  This  situation  may  change  in  coming  years  as  the 
performance  of  processors  and  I/O  interfaces  continues  to  improve. 

3  Software  Architecture  for  Parallel  Rendering 

To  become  useful  tools,  parallel  Tenderers  must  interface  with  other  elements  in  a  larger  computing 
environment.  Figure  1  shows  the  principal  software  components  in  a  typical  parallel  rendering 
application.  Although  not  explicitly  shown,  we  assume  that  most  if  not  all  of  these  will  call  upon  an 
underlying  operating  system  for  services  such  as  memory  and  process  management,  I/O,  and 
communications.  We  discuss  each  component  in  some  detail  in  the  following  sections. 

3.1  Application 

The  application  layer  defines  the  overall  task  to  be  accomplished.  For  assistance,  it  relies  upon  system 
and  domain-specific  libraries,  and  may  invoke  the  Tenderer  directly  for  low-level  graphics  operations.  In 
a  purely  graphical  application,  this  layer  may  be  simply  a  thin  veneer  over  the  Tenderer.  In  other  cases, 
such  as  numerical  simulations  or  virtual  reality,  the  Tenderer  may  be  only  a  small  piece  in  a  much  larger 
puzzle. 

An  important  property  of  the  application  is  its  parallel  programming  paradigm,  which  in  turn  may  be 
heavily  influenced  by  the  architecture  on  which  it  resides.  For  example,  an  application  could  be  written 
in  a  loosely-coupled  MIMD  fashion,  in  which  different  components  are  working  on  significantly  different 
aspects  of  the  problem,  while  sharing  global  information  and  exchanging  intermediate  results.  Or  it  could 
be  written  in  a  tightly-coupled  SIMD  mode,  performing  fine-grained  parallel  operations  in  lock-step  on 
regular  data  structures.  A  popular  programming  style  for  scientific  applications  is  SPMD  (Single 
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Figure  1.  Software  architecture  for  a 
typical  parallel  rendering  application. 
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Program  Multiple  Data),  in  which  all  of  the  processors  execute  the  same  program,  following  roughly  the 
same  path  through  the  code,  with  occasional  synchronization  and  communication  points. 

The  programming  paradigm  adopted  by  an  application  has  a  major  influence  on  all  of  the  software 
layers  below  it,  which  must  be  able  to  interface  with  the  application  on  its  terms.  The  lower  level 
components  must  either  adopt  the  same  paradigm,  or  provide  appropriate  interfaces  between  the 
application’s  structure  and  their  own  internal  strategies. 

Another  crucial  property  of  the  application  is  its  memory  reference  model.  A  message-passing 
application  which  partitions  its  data  structures  across  distributed  memories  may  need  very  different 
library  interfaces  than  one  which  relies  on  global  access  to  shared  data. 

3.2  Libraries 

Most  applications  depend  on  a  variety  of  libraries  for  support  services  ranging  from  I/O  to  numerical 
solutions.  In  our  context,  we  are  interested  in  libraries  which  provide  a  higher-level  interface  to  the  low- 
level  operations  supported  by  the  Tenderer.  For  example,  an  application  which  computes  values  on  a  3D 
grid  may  prefer  to  invoke  isosurface  routines  rather  than  triangle-drawing  primitives.  Visualization 
techniques,  geometry  modeling,  and  user-interface  operations  are  among  the  likely  candidates  for 
implementation  at  the  library  level.  As  an  intermediary  between  the  application  and  the  Tenderer,  the 
library  layer  also  provides  an  opportunity  to  match  the  application’s  programming  paradigm  and  data 
structures  to  those  used  by  the  Tenderer. 

An  important  consideration  for  parallel  libraries,  and  for  all  of  the  software  layers  beneath  them,  is  the 
ability  to  run  with  acceptable  performance  in  a  multi-algorithm  environment.  Many  applications  are 
composed  of  several  different  computations,  or  phases,  each  requiring  different  memory  reference 
patterns  and  communication  topologies.  The  library  developer  must  assume  that  the  application 
programmer  or  end  user  will  map  processes  onto  processors  and  adjust  communication  parameters  to  suit 
the  application’s  needs.  The  communication  patterns,  task  structure,  and  synchronization  requirements  of 
the  application  may  differ  markedly  from  those  of  parallel  visualization  algorithms  or  of  the  Tenderer. 
Thus  supporting  software  layers  should  not  depend  on  algorithms  which  require  particular  mappings  of 
processes  onto  processors  or  specific  physical  communication  topologies.  This  presents  a  challenge  for 
parallel  algorithm  developers,  since  many  existing  techniques  exploit  fixed  interconnection  topologies  to 
obtain  good  performance.  However,  the  development  of  robust,  topology-independent  algorithms  also 
aids  cross-platform  portability,  which  is  particularly  important  given  the  rapid  obsolescence  of  systems 
and  continuing  evolution  of  parallel  architectures. 
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3.3  RendererAPl 


The  Tenderer’s  application  programming  interface  (API)  is  a  low-level  library  which  provides  access  to 
the  graphical  operations  implemented  by  the  Tenderer.  The  API  (perhaps  with  help  from  the  library  layer) 
is  responsible  for  matching  the  application’s  programming  paradigm,  memory  access  model,  and  data 
structures  to  the  underlying  parallel  algorithms  employed  by  the  Tenderer.  A  poorly-designed  API  will 
inflict  additional  burdens  upon  the  application  and  library  layers,  making  the  Tenderer  difficult  or 
inconvenient  to  use. 

The  design  of  the  API  has  a  fundamental  impact  on  the  design  of  the  Tenderer.  To  ensure  that  the  API 
layer  does  not  become  a  significant  source  of  memory,  communication,  or  computational  overheads,  the 
Tenderer  must  directly  and  efficiently  support  the  operations  defined  by  the  API.  Thus  ease-of-use  and 
efficiency  considerations  suggest  that  the  API  layer  should  be  relatively  thin,  and  that  the  Tenderer  should 
be  designed  to  accommodate  the  programming  paradigm  and  data  structuring  conventions  of  the 
prevailing  applications. 

3.4  Renderer 

Many  of  the  parallel  rendering  algorithms  and  implementations  reported  in  the  literature  have  assumed 
that  the  renderer  itself  is  the  driving  application.  In  this  scenario,  the  application  layer  needs  to  do  little 
more  than  read  in  a  scene  description,  arrange  it  in  memory  to  suit  the  rendering  algorithm,  and  dispose  of 
the  resulting  image.  Often  the  focus  has  been  strictly  on  the  renderer,  and  I/O  and  display  times  have 
been  ignored  in  reporting  the  results. 

This  narrow  view  of  the  rendering  application  has  far-reaching  algorithmic  design  consequences.  The 
renderer  is  free  to  assume  that  the  entire  resources  of  the  system  are  available  for  its  use,  including 
memory,  processing  power,  and  I/O.  Thus  it  is  common  to  read  about  rendering  algorithms  which 
assume  that  sufficient  memory  is  available  to  buffer  all  of  the  intermediate  results  at  various  steps  in  the 
rendering  pipeline,  or  to  replicate  the  entire  image  memory  on  every  processor. 

In  reality,  many  of  the  applications  which  reside  on  parallel  systems  have  very  demanding  resource 
requirements  of  their  own  (if  they  didn’t,  they  wouldn’t  need  to  be  on  a  parallel  computer  in  the  first 
place).  To  be  an  effective  tool  in  this  environment,  a  parallel  renderer  must  be  modest  in  its  own 
demands,  and  this  imposes  additional  constraints  on  algorithm  design  [5].  For  example,  a  polygon 
rendering  pipeline  must  actually  be  implemented  in  pipelined  fashion,  rather  than  as  a  series  of  sequential 
stages,  so  that  intermediate  results  can  be  consumed  as  they  are  generated.  Hence  an  application-friendly 
parallel  renderer  is  likely  to  exploit  both  functional  parallelism  and  data  parallelism  [8],  Likewise, 
Tenderers  which  partition  their  image  data  structures  can  better  accommodate  memory-intensive 
techniques  such  as  supersampled  antialiasing  and  transparency  with  less  impact  on  applications. 
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3.5  Image  Assembly 

Since  much  of  the  parallelism  available  in  the  rendering  process  occurs  at  the  pixel  level,  most  parallel 
Tenderers  try  to  distribute  screen-space  computations  across  the  available  processors.  On  distributed- 
memory  architectures  at  least,  this  implies  that  a  series  of  partial  images  must  be  assembled  or 
composited  to  produce  the  final  result.  The  details  of  this  process  depend  upon  the  structure  of  the 
rendering  algorithms,  the  intended  destination  of  the  completed  images,  and  the  hardware  and  software 
architecture  of  the  underlying  system.  We  find  it  useful  to  distinguish  between  two  cases,  internal 
assembly  and  external  assembly  [8]. 

With  internal  assembly,  the  final  image  is  formed  in  its  entirety  somewhere  within  the  memory  of  the 
parallel  system,  and  is  then  routed  to  its  destination,  which  may  be  a  display  device,  a  file,  or  a  remote 
workstation.  Internal  assembly  implies  that  sufficient  memory  must  be  allocated  in  one  place  to 
accommodate  the  full  image.  If  every  processor  allocates  this  space,  then  memory  consumption  may  be 
excessive.  If  only  one  processor  allocates  space  for  the  image,  then  memory  consumption  will  not  be 
uniform  across  processors,  leading  to  potential  complications  for  memory-hungry  applications.  One 
alternative  is  to  single  out  a  processor  to  be  responsible  for  image  assembly,  and  to  offload  other  tasks 
from  it,  thereby  bringing  its  resource  requirements  more  in  line  with  those  of  the  other  processors. 
Another  alternative  is  to  use  an  auxiliary  processor,  such  as  a  service  or  I/O  node,  to  perform  this 
function.  In  either  case,  an  element  of  heterogeneity  is  introduced  into  the  software  environment,  which 
often  translates  into  additional  complexity  for  the  application  designer  or  end  user.  For  these  reasons, 
external  image  assembly  may  be  preferable. 

With  external  assembly,  the  components  which  make  up  an  image  are  routed  to  a  remote  location 
(typically  their  final  destination)  before  being  combined  into  a  whole.  For  example,  different  segments  of 
an  image  may  be  sent  from  each  processor  to  an  addressable  frame  buffer,  in  which  case  the  complete 
image  is  formed  only  within  the  frame  buffer’s  memory  system.  Or,  image  components  may  be  written 
in  encoded  form  to  a  file,  in  which  case  assembly  occurs  only  when  the  file  is  decoded  for  playback. 

With  either  internal  or  external  assembly,  image  components  must  be  retrieved  efficiently  from 
multiple  processors  and  merged  into  an  output  stream.  To  sustain  interactive  or  animation  rates  for  full¬ 
screen  displays,  bandwidths  on  the  order  of  100  MB/s  may  be  required.  While  the  internal 
communication  networks  of  current-generation  systems  may  be  able  to  support  these  data  rates,  software 
overheads  introduce  additional  delays,  particularly  when  large  numbers  of  processors  need  to  combine 
their  results  into  a  single  output  stream.  Merging  algorithms  must  be  carefully  designed  to  exploit 
parallelism  in  order  to  provide  scalability  and  to  reduce  the  impact  of  serial  bottlenecks. 
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3.6  Image  Transport,  Display,  and  Storage 

Because  of  the  large  volume  of  data  involved,  getting  image  streams  out  of  a  parallel  system  and  onto  a 
display  or  storage  device  is  an  important  consideration.  The  problem  is  easiest  to  address  at  the  hardware 
level,  where  devices  such  as  HIPPI  frame  buffers  provide  high-bandwidth  interfaces  to  video  displays. 
To  provide  additional  parallelism  in  the  image  assembly  and  display  process,  several  systems  (including 
the  CM-2  and  nCUBE)  have  incorporated  multi-ported  frame  buffers  [2]  [17].  The  Tenderer  must  then 
assemble  the  image  data  into  several  partial  streams  (typically  one  per  port)  with  appropriate  global 
synchronization. 

3.6.1  Remote  image  display 

Although  directly-attached  display  devices  offer  the  best  performance,  they  suffer  from  a  number  of 
drawbacks  for  parallel  rendering  applications.  For  one  thing,  they  are  a  single-user  resource,  even  though 
most  parallel  systems  support  the  execution  of  multiple  jobs  concurrently.  Perhaps  more  importantly, 
large-scale  parallel  systems  are  scarce  commodities,  typically  serving  a  large  and  geographically 
dispersed  user  community.  For  the  majority  of  users,  access  to  a  directly-connected  display  device  will 
be  either  inconvenient  or  infeasible.  What  is  needed  is  a  way  to  transmit  the  rendered  images  to  the 
user’s  desktop  at  rates  which  will  support  interaction  and  avoid  excessive  I/O  delays  for  the  application. 
Given  the  bandwidth  requirements  outlined  above,  and  the  typical  performance  of  congested  long-haul 
networks,  this  is  a  challenging  problem  indeed.  However,  for  local  area  networks  the  problem  is  more 
manageable,  and  we  have  had  some  success  in  delivering  image  streams  from  parallel  systems  to  desktop 
workstations  [5]  [7]. 

Existing  networks  at  most  sites  provide  peak  bandwidths  on  the  order  of  1—10  MB/s  using  Ethernet, 
Fast  Ethernet,  FDDI,  or  similar  technologies.  Sluggish  network  interfaces,  software  overheads,  and 
contention  with  other  traffic  can  easily  reduce  this  by  30-75%.  Since  this  is  far  short  of  the  100  MB/s 
needed  for  animation,  some  compromises  are  clearly  in  order.  Order-of-magnitude  reductions  in 
bandwidth  requirements  can  be  achieved  by  resorting  to  lower-resolution  images  (640x512  vs. 
1280  x  1024)  and  reduced  color  precision  (8  bits  vs.  24).  In  some  applications,  lower  frame  rates  (1-10 
fps)  may  also  be  acceptable,  reducing  bandwidth  demands  even  further.  By  combining  these  strategies,  it 
is  possible  to  reach  a  level  at  which  Ethernet  is  a  tolerable,  if  not  wholly  satisfactory,  medium  for 
delivering  output  streams  from  parallel  Tenderers. 

3.6.2  Image  compression 

For  some  applications  these  compromises  are  not  acceptable,  and  in  other  cases  the  available  network 
bandwidth  may  still  be  insufficient.  To  accommodate  these  situations,  we  must  resort  to  data 
compression  techniques,  perhaps  in  combination  with  some  of  the  other  strategies.  Although  many 
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methods  are  available  for  compressing  images  and  video  [3] [4],  with  new  developments  appearing  almost 
daily,  little  if  any  of  this  work  has  been  done  with  parallel  rendering  applications  in  mind. 

The  parallel  rendering  environment  imposes  some  additional  requirements  and  opportunities  which  are 
not  present  in  most  image  compression  applications.  We  have  already  noted  that  the  image  data  is  likely 
to  be  partitioned  among  multiple  processors.  This  raises  the  possibility  that  the  compression  phase  can  be 
performed  in  parallel  on  different  segments  of  the  image.  As  the  number  of  processors  grows,  so  does  the 
available  parallelism,  but  the  amount  of  image  data  per  processor  decreases,  reducing  the  length  of  the 
input  string.  To  complicate  matters  further,  load  balancing  considerations  often  favor  decompositions 
which  scatter  the  image  data  across  processors,  thereby  avoiding  hotspots  due  to  local  variations  in  image 
complexity  [8].  Unfortunately,  this  also  limits  the  ability  of  compression  algorithms  to  exploit  spatial 
coherence.  On  the  other  hand,  the  image  streams  generated  in  many  parallel  rendering  applications  do 
not  vary  radically  from  one  frame  to  the  next,  so  the  opportunity  exists  to  exploit  temporal  coherence.  On 
the  receiving  end,  we  expect  to  have  a  PC-  or  workstation-class  system  available  to  perform  the 
decompression,  typically  employing  a  single  processor.  Thus  the  computing  power  available  to 
decompress  the  data  may  be  one  to  two  orders  of  magnitude  less  than  that  available  for  compressing  it. 

With  these  considerations  in  mind,  an  ideal  compression  scheme  for  parallel  rendering  would: 

•  compress  with  reasonable  speed, 

•  parallelize  well,  with  minimal  interprocessor  communication, 

•  exhibit  good  compression  with  relatively  short  input  strings, 

•  accept  arbitrary  orderings  of  the  input  data, 

•  exploit  temporal  coherence,  and 

•  decompress  very  rapidly. 

For  computer  graphics  applications  in  which  image  quality  is  paramount,  or  for  visualization  applications 
in  which  accuracy  is  essential,  we  also  insist  on  lossless  compression  schemes.  In  other  cases, 
particularly  when  bandwidth  is  low,  lossy  methods  may  be  acceptable,  or  even  required.  We  are  aware  of 
very  little  work  that  has  been  done  on  real-time  image  compression  and  transmission  methods  for  parallel 
rendering,  making  this  a  fruitful  area  for  future  research.  Some  simple  techniques  designed  for  use  on 
local  area  networks  are  described  in  [5]  and  [9]. 

Although  network-based  image  transmission  lacks  the  responsiveness  of  graphics  workstations,  we  are 
optimistic  about  the  prospects  for  the  future.  All  of  the  components  involved  (processors,  network 
interfaces,  network  infrastructure,  real-time  protocols,  compression  algorithms,  etc.)  are  improving,  and 
we  expect  that  remote  graphical  interaction  with  parallel  applications  will  be  quite  practical  in  a  few 
years. 
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3.6.3  Image  storage 

In  some  applications,  real-time  interaction  is  not  a  primary  concern.  For  example,  sophisticated 
photorealistic  rendering  techniques  may  take  many  minutes  to  generate  an  image,  even  with  parallel 
methods.  Large  scientific  applications  may  also  take  minutes  or  hours  to  compute  a  single  iteration  or 
timestep,  and  the  realities  of  supercomputer  scheduling  may  force  large  jobs  to  run  in  batch  queues  with 
unpredictable  and  inconvenient  starting  times.  In  these  situations,  it  is  more  appropriate  to  route  the 
image  stream  to  secondary  storage  for  later  perusal.  For  interactive  applications,  it  may  be  desirable  to 
save  a  copy  of  the  visual  output  for  future  reference,  or  to  take  snapshots  of  particularly  interesting 
frames. 

Even  with  compression,  long  animation  sequences  can  become  quite  large,  particularly  with  high 
quality  images.  A  long  simulation  could  reasonably  generate  several  gigabytes  of  image  data. 
Particularly  for  batch  applications,  we  may  be  able  to  tolerate  increased  compression  and  (possibly) 
decompression  times  in  exchange  for  more  compact  output  files.  If  the  image  files  can  be  structured 
appropriately,  it  may  also  be  possible  to  take  advantage  of  parallel  I/O  operations  to  reduce  write  times. 
This  is  in  contrast  to  networked  transmission,  where  the  image  data  typically  needs  to  be  serialized  and 
written  to  a  single  socket  descriptor. 

These  differences  suggest  that  the  image  assembly  algorithms,  compression  methods,  and  data  formats 
for  file  storage  may  need  to  be  different  than  those  for  direct  display  or  remote  transmission.  Thus  our 
parallel  rendering  systems  should  be  designed  in  a  flexible  manner  which  will  accommodate  multiple 
output  strategies. 

It  should  be  apparent  from  the  forgoing  discussion  that  image  handling  considerations  can  impact  the 
design  of  the  Tenderer  itself.  For  example,  the  optimum  image  partitioning  strategy  from  a  rendering 
standpoint  could  lead  to  extra  communication  in  the  image  assembly  phase,  or  reduced  effectiveness  of 
the  compression  algorithms.  Thus  it  is  essential  to  bear  in  mind  the  overall  performance  of  the  system 
and  the  feedback  relationships  between  each  of  the  components. 

3.7  User  Interface 

The  ability  to  interact  with  a  scene  and  its  contents  is  essential  in  many  computer  graphics  applications. 
Static  views  which  are  determined  a  priori  may  be  necessary  for  batch-mode  applications,  but  invariably 
there  are  features  of  interest  which  are  obscured  or  are  difficult  to  interpret  properly  with  a  fixed 
viewpoint  and  static  lighting.  Providing  the  user  with  the  ability  to  modify  what  he  is  seeing  greatly 
enhances  the  power  of  computer  graphics.  This  is  especially  true  for  parallel  rendering  applications, 
which  often  involve  large,  complex,  and  dynamic  scenes. 
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The  user  interface  should  be  tightly-coupled  to  the  display  software,  since  the  most  effective 
interaction  mechanisms  are  often  those  which  involve  direct  manipulation  of  the  imagery.  With  directly- 
attached  hardware  displays,  the  user  interface  software  will  need  to  be  integrated  into  the  parallel 
environment,  perhaps  residing  on  a  single  processor  which  is  dedicated  to  that  task.  With  remote 
displays,  it  is  more  appropriate,  and  generally  more  convenient,  to  host  the  user  interface  code  on  the 
receiving  workstation,  along  with  the  display  code.  At  a  minimum,  the  user  interface  component  is 
responsible  for  handling  device  events  and  communicating  them  back  to  the  application.  2-D  GUI 
interfaces  are  probably  best  implemented  on  the  workstation  side  where  they  can  take  advantage  of 
existing  libraries  and  operating  system  support.  3-D  GUIs,  in  which  interface  objects  appear  in  the  scene, 
may  need  to  be  implemented  on  the  rendering  side.  This  poses  some  interesting  issues  in  modeling  and 
interacting  with  interface  elements  which,  for  load  balancing  and  other  reasons,  may  need  to  be 
distributed  across  multiple  processors. 

User  interaction  requests  can  occur  at  several  different  levels  in  the  software  hierarchy.  For  example, 
changes  in  viewing  parameters  and  light  sources  will  generally  be  handled  through  the  Tenderer’s  API, 
while  manipulation  of  geometric  models  and  visualization  parameters  might  be  implemented  within  the 
library  layer.  For  purposes  of  debugging  and  interactive  steering,  the  user  interface  may  communicate 
directly  with  the  application  in  order  to  interrogate  and  modify  its  variables  and  data  structures. 

As  we  pointed  out  in  Section  2.1,  highly  responsive  interaction  is  difficult  to  achieve  with  software- 
based  solutions,  and  placing  the  display  and  user  interface  on  a  remote  desktop  only  exacerbates  the 
problem.  This  implies  that  interaction  mechanisms  should  be  designed  to  cope  with  high  latency  and 
sluggish  image  updates.  To  prevent  the  user  from  getting  too  far  ahead  of  the  application,  event  streams 
may  need  to  be  collapsed  in  order  to  avoid  the  overhead  of  rendering  and  transmitting  images  that  will  be 
seriously  out  of  sync  by  the  time  they  arrive. 

Another  interesting  situation  arises  in  dealing  with  applications  which  require  lengthy  computations  in 
order  to  generate  a  new  image.  For  example,  it  may  be  natural  for  a  numerical  simulation  to  invoke  the 
Tenderer  at  the  end  of  an  iteration  or  a  time  step  to  display  the  current  state  of  the  computation.  If  the 
time  between  updates  exceeds  a  few  seconds,  interactivity  is  effectively  lost.  There  are  two  principal 
options  for  dealing  with  this  situation.  One  possibility  is  for  the  Tenderer  to  run  as  a  different  process  (or 
at  least  a  different  thread),  either  sharing  processors  with  the  application,  or  perhaps  running  in  its  own 
dedicated  pool  of  processors.  The  Tenderer  can  then  respond  to  user  interface  requests  asynchronously 
with  respect  to  the  application.  Unfortunately,  the  application’s  data  structures  are  likely  to  be  in  an 
inconsistent  state  during  the  midst  of  an  iteration,  forcing  the  Tenderer  to  maintain  its  own  copy  of  scene- 
related  data.  This  copy  must  then  be  updated  periodically  in  coordination  with  the  application. 

An  alternate  strategy,  which  avoids  data  copying  and  separate  processes,  is  to  have  the  user  indicate 
when  he  wants  to  interact  with  the  application.  The  application  pauses  once  it  reaches  a  consistent  state, 
switching  control  to  the  Tenderer  which  can  then  operate  on  stable  data  using  the  full  processing  power  of 
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the  system.  When  the  user  finishes  an  interaction  sequence,  he  directs  the  application  to  resume  its 
operation.  Since  neither  of  these  strategies  is  entirely  satisfactory,  the  choice  should  be  made  depending 
on  the  application’s  requirements  and  user  preferences. 

Designing  a  parallel  Tenderer  is  a  complex  process  which  requires  careful  balancing  of  competing 
requirements  and  numerous  tradeoffs.  Unfortunately,  adopting  a  system-level  view  of  the  process  only 
complicates  matters  by  introducing  additional  considerations  and  imposing  new  constraints.  We  have 
identified  several  of  these  issues  in  the  above  discussion,  but  there  are  no  doubt  others  which  we  have 
neglected.  Finding  acceptable  compromises  will  be  the  key  to  developing  parallel  rendering  systems 
which  become  useful  components  in  the  parallel  computing  toolbox. 

4  Challenges  and  Opportunities  for  Parallel  Rendering  Systems 

We  now  turn  our  attention  to  several  areas  which  we  think  will  present  both  challenges  and  opportunities 
for  parallel  rendering  research  in  the  next  few  years.  Of  particular  interest  are  issues  of  portability, 
scalability,  and  ease-of-use. 

4.1  Towards  Portability 

Portability  is  a  major  concern  in  the  development  of  full-featured  systems  to  support  parallel  graphics  and 
visualization  applications.  The  amount  of  code  required  to  supply  the  needed  capabilities  is  too  large  to 
reimplement  each  time  a  new  architecture  comes  along.  On  parallel  systems,  portability  implies  much 
more  than  just  having  the  code  compile  cleanly  on  a  new  system.  Performance  considerations  often  result 
in  algorithms  which  have  architectural  assumptions  deeply  ingrained  within  them.  Transferring  the  code 
to  a  different  platform  may  require  substantial  re-engineering,  or  perhaps  even  reformulation  of  the 
problem. 

4.1.1  Programming  paradigms 

One  approach  to  portability  is  to  adopt  a  lowest-common-denominator  programming  paradigm.  A 
potential  candidate  is  the  data-parallel  programming  model,  which  has  proven  itself  to  be  useful  on  a 
variety  of  architectures.  Although  many  of  the  concepts  originated  in  the  SIMD  world,  data-parallel 
algorithms  can  often  be  implemented  at  different  granularities  to  suit  the  characteristics  of  the  target 
architecture,  and  numerous  parallel  Tenderers  have  been  implemented  using  this  paradigm. 

However,  the  rendering  process  involves  many  irregularities  [8]  which  lend  themselves  more  naturally 
to  SPMD  or  MIMD  implementations,  and  data-parallel  rendering  algorithms  have  achieved  their  success 
primarily  on  SIMD  architectures  with  low  communication  overheads.  Both  SPMD  and  SIMD 
programming  models  are  common  in  scientific  and  engineering  applications,  although  the  SPMD 
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paradigm  appears  to  offer  more  flexibility  for  computations  involving  complex  grids  and  irregular  data 
structures.  Despite  the  advantages  of  a  unified  programming  model,  it  is  questionable  whether  portability 
considerations  will  outweigh  the  need  to  adapt  the  Tenderer  to  the  prevailing  paradigm  on  a  given 
architecture. 

4.1.2  Access  to  memory 

Another  impediment  to  portability  is  the  diversity  of  memory  access  models.  At  one  extreme  is  global 
shared  memory,  typically  found  on  symmetric  multiprocessors.  At  the  other  is  message-passing,  used  by 
many  distributed-memory  architectures  as  well  as  networks  of  workstations.  It  is  generally  agreed  that 
message-passing  programs  are  more  complex  and  require  more  lines  of  code  than  their  shared-memory 
counterparts.  Message-passing  also  involves  considerable  software  overheads,  even  with  assistance  from 
dedicated  co-processors.  Several  recent  parallel  architectures  provide  hardware  support  for  global 
addressing  of  physically  distributed  memory.  The  resulting  reductions  in  communication  overhead 
enable  algorithmic  approaches  which  are  not  practical  in  message-passing  environments,  and  this 
advantage  is  apparent  in  parallel  rendering  applications  [19]. 

Does  this  mean  message  passing  is  dead?  Not  necessarily.  Locality  is  still  important,  and  message 
passing  makes  it  apparent  and  provides  explicit  control  [15].  For  portability  purposes,  message  passing 
also  serves  as  a  lowest  common  denominator,  since  it  can  be  implemented  with  relative  ease  in  shared 
memory  environments. 

4.1.3  Parallel  programming  languages 

An  alternate  avenue  to  portability  is  the  use  of  high-level  parallel  programming  languages.  Parallel 
languages  provide  a  common  programming  paradigm  and  memory  access  model  across  architectures  by 
shifting  (some  of)  the  burden  of  architecture-specific  adaptations  to  the  compiler.  The  most  widely 
accepted  example  is  High  Performance  Fortran  (HPF)  [12],  which  is  now  supported  by  a  number  of 
vendors.  The  initial  version  of  the  language  was  intended  primarily  for  data-parallel  applications,  but 
current  developments  are  intended  to  broaden  its  applicability  to  irregular  problems.  Similar  efforts 
involving  parallel  derivatives  of  C++  may  be  more  suitable  for  computer  graphics  applications,  which 
have  traditionally  been  written  in  C  and  C++  to  take  advantage  of  more  sophisticated  data  structures. 
Nonetheless,  the  availability  of  HPF  on  an  assortment  of  architectures  presents  an  opportunity  to  assess 
the  use  of  parallel  languages  to  enhance  the  portability  of  rendering  applications. 

4.2  Parallel  Rendering  on  Teraflops  and  Petaflops  Architectures 

Emerging  supercomputer  architectures  will  require  higher  levels  of  parallelism  and  offer  higher 
performance  than  anything  previously  encountered.  The  first  general-purpose  teraflops  systems 
(procured  for  the  U.S.  Department  of  Energy’s  Accelerated  Strategic  Computing  Initiative  [1])  will 
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employ  on  the  order  of  10  -104  high-performance  commercial  microprocessors,  in  relatively  conventional 
architectural  arrangements.  To  date,  most  parallel  rendering  implementations  on  similar  architectures 
have  not  scaled  well  beyond  100-200  processors  [6][10][11][13].  We  conjecture  that  this  “scalability 
barrier”  arises  from  inherent  properties  of  the  rendering  problem. 

First  of  all,  typical  rendering  applications  are  limited  to  image  resolutions  of  about  1  megapixel  or  less. 
Given  that  the  computations  performed  at  a  single  pixel  represent  a  relatively  fine-grained  task,  we  must 
aggregate  many  of  them  to  obtain  efficient  performance.  As  we  increase  the  number  of  processors,  the 
image  size  usually  stays  fixed,  so  task  granularity  decreases.  Since  a  significant  portion  of  the  available 
parallelism  occurs  at  the  pixel  level,  Amdahl’s  law  implies  that  a  fixed  image  resolution  will  limit  the 
ultimate  scalability  of  parallel  Tenderers. 

To  compound  the  problem,  rendering  is  inherently  communication-intensive.  The  projection  from 
object-space  to  image-space  results  in  communication  graphs  which  are  dense,  scene-  and  view- 
dependent,  and  dynamic  (due  to  user  interaction  and  changing  scene  content).  As  the  number  of 
processors  increases,  so  do  the  number  of  communication  operations  required  at  each  processor,  while  the 
amount  of  data  transferred  with  each  operation  decreases.  Although  store-and-forward  aggregation 
schemes  can  improve  performance  by  reducing  the  communication  complexity  [  1 0]  [  1 1  ] ,  they  merely 
substitute  less  expensive  data  copying  for  more  expensive  data  communication,  so  the  inherent  overheads 
persist,  although  with  reduced  severity. 

We  conclude  from  this  that  parallel  rendering  algorithms  have  inherent  limits  on  scalability  which 
depend  more  on  workstation  display  technology  than  on  the  complexity  of  the  scenes  being  rendered. 
Whether  these  limits  will  come  into  play  in  a  given  application  depends  in  part  on  architectural 
parameters  (such  as  communication  overheads  and  the  number  of  processors),  and  in  part  on 
characteristics  of  the  application  (such  as  the  percentage  of  computation  due  to  screen-space  operations). 

What  are  the  implications  of  this  for  parallel  rendering  on  future  architectures?  Current  projections  are 
that  petaflops-class  systems  will  require  from  105  to  106  parallel  threads  to  keep  them  busy  [16].  Unless 
display  resolutions  increase  dramatically  over  the  next  ten  years,  we  will  have  image-space  tasks 
consisting  of  at  most  a  few  pixels.  Will  communication  systems  improve  to  the  point  that  we  can 
effectively  partition  our  images  at  this  level  of  granularity?  The  growing  gap  between  processor  speeds 
and  memory  access  times  suggests  that  this  won’t  be  the  case. 

Does  this  mean  that  software-based  parallel  rendering  will  not  be  a  viable  option,  and  that  we  should 
turn  instead  to  specialized  rendering  engines  which  are  closely  coupled  to  large  parallel  systems?  Or  will 
we  simply  run  the  rendering  operations  on  a  subset  of  the  available  processors,  letting  the  others  go  idle? 
Should  we  even  care  about  processor  utilization,  as  long  as  frame  rates  are  high  enough  to  support  real¬ 
time  interaction? 

On  the  other  hand,  the  ability  to  perform  a  trillion  or  quadrillion  operations  per  second  may  change  the 
way  we  think  about  rendering.  Perhaps  volume  rendering,  global  illumination  methods,  procedural 
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textures,  36-bit  color,  16x  supersampling,  and  HDTV  resolution  will  all  become  routine.  What  new 
rendering  and  visualization  methods  will  this  massive  computing  power  enable,  and  how  will  they  benefit 
the  large-scale  scientific  and  engineering  applications  which  will  run  on  these  platforms?  Finding  the 
answers  to  these  and  other  questions  will  be  the  objective  of  many  new  explorations  in  parallel  rendering 
during  the  next  decade. 

4.3  APIs  for  Parallel  Rendering 

At  the  present  time,  there  is  no  standard  API  for  parallel  rendering,  and  only  a  handful  of  parallel 
Tenderers  provide  library  interfaces  for  application  programs.  To  gain  wide  acceptance,  a  standard  library 
interface  for  parallel  graphics  needs  to  be  developed,  in  much  the  same  way  that  MPI  [14]  has  unified 
disparate  message-passing  implementations.  Whether  a  single  graphics  API  can  support  a  sufficiently 
wide  range  of  parallel  applications  is  an  open  question. 

Portability  between  sequential  and  parallel  platforms  is  also  desirable,  and  this  argues  for  extending  or 
adapting  existing  graphics  APIs.  OpenGL  is  clearly  the  de  facto  standard  for  serial  architectures,  but  it  is 
not  obvious  that  it  provides  an  appropriate  basis  for  parallel  rendering.  There  are  two  main  problem 
areas:  immediate  mode  rendering,  and  constraints  on  rendering  order.  Very  few  parallel  Tenderers 
provide  true  immediate  mode  rendering  for  individual  geometric  primitives.  With  dedicated  hardware, 
dropping  a  polygon  description  into  the  head  of  a  pipeline  and  waiting  for  the  pixels  to  pop  out  at  the 
other  end  is  a  natural  operation.  In  the  parallel  environment,  however,  the  overheads  involved  for 
synchronization,  communication,  and  image  updating  make  operations  at  this  fine  level  of  granularity 
very  inefficient.  Instead,  parallel  Tenderers  generally  accept  an  entire  scene  description  (or  similarly 
coarse-grained  objects)  as  their  input,  and  often  employ  considerable  internal  buffering  of  intermediate 
results  to  obtain  satisfactory  performance. 

OpenGL  also  insists  on  maintaining  the  order  of  rendered  primitives,  which  is  useful  for  “painting” 
objects  on  top  of  one  another.  In  the  parallel  environment,  the  concept  of  time  is  much  fuzzier,  requiring 
explicit  synchronization  operations  in  order  to  impose  global  order  (except  on  SIMD  systems,  where 
instruction-level  synchronization  is  provided  by  the  hardware).  Defining  and  enforcing  global  order  at 
the  level  of  individual  primitives  is  such  a  cumbersome  proposition  that  parallel  Tenderers  have  almost 
universally  abandoned  it,  relying  instead  on  other  techniques  (such  as  multi-pass  rendering)  to  provide 
order  in  those  rare  cases  when  an  application  demands  it. 

We  may  also  need  to  extend  existing  APIs  to  accommodate  new  rendering  methods.  Volume 
rendering,  in  particular,  is  becoming  an  increasingly  important  and  accepted  tool,  in  part  because  parallel 
implementations  are  making  it  more  practical.  Thus  it  seems  useful  to  support  both  surface  rendering  and 
volume  rendering  within  a  single  API,  or  a  coordinated  set  of  APIs.  While  the  rendering  algorithms  for 
each  of  these  techniques  might  be  very  different,  there  is  a  good  chance  that  they  could  share  much  of  the 
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back-end  image  handling  and  display  infrastructure,  perhaps  allowing  for  volume-rendered  and  polygon- 
rendered  geometry  to  co-exist  within  a  single  frame. 

4.4  High-level  Support  for  Parallel  Graphics  and  Visualization 

Although  computer  graphics  is  a  powerful  tool,  it  remains  underutilized  by  most  application 
programmers.  This  is  due  in  part  to  the  specialized  knowledge  which  is  needed  to  set  up  things  like 
viewing  parameters  and  modeling  transformations,  but  it  also  reflects  the  relatively  tedious  level  of  detail 
at  which  most  graphics  APIs  operate.  What  is  needed  is  a  more  intuitive  and  simpler  way  to  interface 
with  the  Tenderer,  using  operations  on  application-level  data  structures  such  as  arrays  or  grids.  When 
displaying  an  isosurface  becomes  as  easy  as  calling  “printf ’,  computer  graphics  will  come  into  its  own  as 
a  tool  for  application  developers. 

Going  one  step  further,  we  can  imagine  the  insertion  of  compiler  directives  which  would  indicate  that  a 
particular  data  structure,  say  a  3D  grid,  is  to  be  displayed  visually  at  runtime.  By  enabling  an  appropriate 
compiler  option  or  runtime  switch,  a  GUI  interface  could  pop  up  with  a  variety  of  visualization  options, 
ranging  from  cutting  planes  and  isosurfaces  to  streamlines  and  volume  rendering.  While  none  of  this  is 
specific  to  the  parallel  environment,  the  power  of  a  parallel  system  can  make  high-level  visualization 
operations  more  responsive.  Compilers  for  languages  such  as  HPF  also  have  fairly  detailed  knowledge  of 
architectural  parameters  and  data  flow  within  an  application,  and  it  may  be  possible  to  exploit  this 
information  to  generate  efficient  code  for  graphical  operations. 

5  Conclusion 

With  some  effort,  software-based  rendering  techniques  can  be  successfully  used  on  current  parallel 
architectures  to  generate  visual  output  from  application  programs.  Improvements  in  usability  and 
performance  will  derive  from  a  broader,  system-level  view  of  the  rendering  process  and  its  various 
components.  Important  challenges  remain,  particularly  in  the  areas  of  scalability,  portability,  ease-of-use, 
and  image  handling.  Meeting  these  challenges  will  allow  software-based  parallel  Tenderers  to  evolve 
from  research  curiosities  into  viable  tools  for  production  computing. 
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